17 research outputs found

    A plea for more interactions between psycholinguistics and natural language processing research

    Get PDF
    A new development in psycholinguistics is the use of regression analyses on tens of thousands of words, known as the megastudy approach. This development has led to the collection of processing times and subjective ratings (of age of acquisition, concreteness, valence, and arousal) for most of the existing words in English and Dutch. In addition, a crowdsourcing study in the Dutch language has resulted in information about how well 52,000 lemmas are known. This information is likely to be of interest to NLP researchers and computational linguists. At the same time, large-scale measures of word characteristics developed in the latter traditions are likely to be pivotal in bringing the megastudy approach to the next level

    Corpus linguistics

    Get PDF
    The first comprehensive guide to research methods and technologies in psycholinguistics and the neurobiology of language Bringing together contributions from a distinguished group of researchers and practitioners, editors Annette M. B. de Groot and Peter Hagoort explore the methods and technologies used by researchers of language acquisition, language processing, and communication, including: traditional observational and behavioral methods; computational modelling; corpus linguistics; and virtual reality. The book also examines neurobiological methods, including functional and structural neuroimaging and molecular genetics. Ideal for students engaged in the field, Research Methods in Psycholinguistics and the Neurobiology of Language examines the relative strengths and weaknesses of various methods in relation to competing approaches.  It describes the apparatus involved, the nature of the stimuli and data used, and the data collection and analysis techniques for each method. Featuring numerous example studies, along with many full-color illustrations, this indispensable text will help readers gain a clear picture of the practices and tools described.  Brings together contributions from distinguished researchers across an array of related disciplines who explain the underlying assumptions and rationales of their research methods Describes the apparatus involved, the nature of the stimuli and data used, and the data collection and analysis techniques for each method Explores the relative strengths and weaknesses of various methods in relation to competing approaches Features numerous real-world examples, along with many full-color illustrations, to help readers gain a clear picture of the practices and tools describe

    Which words do English non-native speakers know? New supernational levels based on yes/no decision

    Get PDF
    To have more information about the English words known by second language (L2) speakers, we ran a large-scale crowdsourcing vocabulary test, which yielded 17 million useful responses. It provided us with a list of 445 words known to nearly all participants. The list was compared to various existing lists of words advised to include in the first stages of English L2 teaching. The data also provided us with a ranking of 61,000 words in terms of degree and speed of word recognition in English L2 speakers, which correlated r = .85 with a similar ranking based on native English speakers. The L2 speakers in our study were relatively better at academic words (which are often cognates in their mother tongue) and words related to experiences English L2 students are likely to have. They were worse at words related to childhood and family life. Finally, a new list of 20 levels of 1,000 word families is presented, which will be of use to English L2 teachers, as the levels represent the order in which English vocabulary seems to be acquired by L2 learners across the world

    Recognition times for 62 thousand English words : data from the English Crowdsourcing Project

    Get PDF
    We present a new dataset of English word recognition times for a total of 62 thousand words, called the English Crowdsourcing Project. The data were collected via an internet vocabulary test in which more than one million people participated. The present dataset is limited to native English speakers. Participants were asked to indicate which words they knew. Their response times were registered, although at no point were the participants asked to respond as quickly as possible. Still, the response times correlate around .75 with the response times of the English Lexicon Project for the shared words. Also, the results of virtual experiments indicate that the new response times are a valid addition to the English Lexicon Project. This not only means that we have useful response times for some 35 thousand extra words, but we now also have data on differences in response latencies as a function of education and age

    How useful are corpus-based methods for extrapolating psycholinguistic variables?

    No full text
    Subjective ratings for age of acquisition, concreteness, affective valence, and many other variables are an important element of psycholinguistic research. However, even for well-studied languages, ratings usually cover just a small part of the vocabulary. A possible solution involves using corpora to build a semantic similarity space and to apply machine learning techniques to extrapolate existing ratings to previously unrated words. We conduct a systematic comparison of two extrapolation techniques: k-nearest neighbours, and random forest, in combination with semantic spaces built using latent seman- tic analysis, topic model, a hyperspace analogue to language (HAL)-like model, and a skip-gram model. A variant of the k-nearest neighbours method used with skip-gram word vectors gives the most accurate predictions but the random forest method has an advantage of being able to easily incorporate additional predictors. We evaluate the usefulness of the methods by exploring how much of the human perform- ance in a lexical decision task can be explained by extrapolated ratings for age of acquisition and how precisely we can assign words to discrete categories based on extrapolated ratings. We find that at least some of the extrapolation methods may introduce artefacts to the data and produce results that could lead to different conclusions that would be reached based on the human ratings. From a practical point of view, the usefulness of ratings extrapolated with the described methods may be limited

    Recognition times for 62 thousand English words: Data from the English Crowdsourcing Project

    No full text
    corecore